home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
MacWorld 1999 July
/
Macworld (1999-07).dmg
/
Shareware World
/
Info
/
For Developers
/
Mops 3.4.sea
/
Mops source
/
PPC source
/
cg1
< prev
next >
Wrap
Text File
|
1999-01-14
|
37KB
|
1,125 lines
marker m__cg1
\ =========================================
\ POWERPC CODE GENERATOR
\ =========================================
\ NOTE: all the technical documentation is at the end of this file,
\ after the endload.
true constant assertions?
\ make it false when everything really works?!?
PPC?
[IF]
false constant debug?
[ELSE]
false constant debug?
[THEN]
false constant 64bit?
\ =============== OPTIMIZATION FLAGS ===============
(* it's useful to be able to turn these off, to find those
recalcitrant bugs - and in the case of range checking,
to be able to turn it off for speed-critical tested
code. Or for floating multiply-add, someone might want to
turn it off to restore strict IEEE FP semantics.
NOTE: the default for all these boolean values is TRUE.
*)
true value hoist? \ can we move ops out of loops if
\ they're invariant in the loop?
true value hoist_fetches? \ can we move fetches to as early
\ as possible?
true value allow_match? \ true if we can eliminate an op if
\ we find the result already in
\ a register. Basically the same
\ as CSE (common subexpression
\ elimination).
true value move_by_recompiling? \ can we try to recompile an earlier
\ op to avoid a reg move?
true value optimize_leaf_calls? \ do we use the fast calling sequence
\ for leaf definitions?
true value optimize_branches? \ do we simplify branches over other
\ branches, branches to the next
\ instruction, etc.?
true value cascade? \ do we combine ops into single
\ instructions where we can?
true value range_check? \ do we range-check array accesses?
true value multiply-add? \ do we cascade a floating multiply
\ followed by an add into a
\ floating multiply-add instruction
\ (or one of its variants)?
\ =============== FORWARD DEFNS ================
forward .G
forward .F
forward .C
forward .GS
forward .CS
forward .AL
forward .FAL
forward .FR
forward ZS
\ =============== OTHER GLOBALS ==============
0 value backstop_CDP
0 value fetch_backstop
0 value basic_block_start
0 value loop_start
0 value max_called_#PL
0 value max_called_#FPL
false value equalizing?
false value eq_block_recompiling_move?
bytestring eq_ranges
bytestring const_data
bytestring sv_const_data
0 constant CR0
4096 constant sys_SP_framesize
8 constant FPcell
(* We don't move the stack pointers every time we push and pop something
- rather, we keep track of the accumulated offset here and only adjust
when we have to.
*)
0 value STK_OFFSET
0 value FSTK_OFFSET
\ ================= OD FIELD VALUES ==================
\ Here we define various values for OD fields. We generally use the same
\ names as for the 68k code generator, although the values aren't all
\ the same (which doesn't matter anyway).
\ Mode values:
enum { mdGPR mdBD mdAbs mdLit mdPC mdFPn }
PPC? [IF] hexx [ELSE] hex [THEN]
\ Flag byte bits:
\ 0 constant flExt \ Sign extend
1 constant fbExt
\ 1 constant flFP \ Floating operation
\ 2 constant fbFP
(*
2 constant flLit \ Floating Literal
4 constant fbLit
3 constant flFCR \ FPU constant ROM reference
8 constant fbFCR
*)
\ Operation type values (opType):
\ 0 means empty, 1 to 7 mean not empty but unknown for some reason.
\ So far we've only defined 1-3.
1 constant otUnknown \ isn't empty, but we don't know anything about
\ the contents
2 constant otUnkStored \ ditto, but we've stored it and so might be
\ able to optimize a fetch from the same
\ location, even tho we don't know its type
3 constant otUnkPulled \ ditto, pulled from memory part of stack
7 constant otUnknownCodes \ all codes less than or equal to this are
\ some kind of empty/unknown
8 constant otMove \ reg move
\ the following are also on the 68k, and in target compilation they
\ dispatch to us from the 68k interpreter using these codes, so we
\ can't change them!
12 constant otMUL
13 constant otDIV
14 constant otUDIV
\ the following are PPC only:
10 constant otMULH \ multiply high
11 constant otUMULH \ multiply high
16 constant otAddc
\ 17 constant otAddic
17 constant otAdde
18 constant otAddze
19 constant otAddme
1A constant otSubfc
\ 1C constant otSubfic
1B constant otSubfe
1C constant otSubfze
1D constant otSubfme
\ these following are also 68k codes, and can't change
21 constant otADD
22 constant otSUB
23 constant otAND
24 constant otOR
25 constant otXOR
26 constant otCMP
27 constant otUCMP \ unsigned compare - PPC only
28 constant otNEG
29 constant otNOT
2A constant otShift
2B constant otShift&mask \ PPC only
2C constant otTrap \ ditto
30 constant otPMend \ End of integer ops
\ 3F constant otFPcmp \ Floating-point comparison. A special case.
40 constant otFPstart \ Start of regular floating-point ops. Note
\ these are NOT in the same order as the
\ integer ops.
40 constant otFMOVE
41 constant otFADD
42 constant otFMUL
43 constant otFMADD \ multiply-add - PPC only
44 constant otFNMSUB \ negative multiply-subtract - AltiVec only
48 constant otFPnoncom \ The following FP ops are non-commutative
48 constant otFSUB
49 constant otFDIV
4F constant otFPcmp \ FP comparisons
160 constant otFPstore
162 constant otFPfetch
54 constant otFPmon \ The following FP ops are monadic
(* we don't bother including these in the dictionary, but these
are the values:
54 constant otFABS
55 constant otFNEG
56 constant otFSIN
57 constant otFCOS
58 constant otFTAN
59 constant otFATAN
5A constant otFSQRT
*)
5F constant otFPend \ End of FP ops
60 constant otStore \ Store
62 constant otFetch \ Direct fetch
\ otFetch
\ constant otAt \ Indirect fetch - ends up being treated
\ the same
62 constant otDUP \ Stack shuffling
63 constant ot2DUP
64 constant otDROP
65 constant ot2DROP
66 constant otSWAP
67 constant otOVER
(* again we don't bother including these in the dictionary, but
these are the values:
68 constant otNIP
69 constant otTUCK
6A constant otROT
6B constant otDOWN
6C constant ot2SWAP
For FP stack shuffling, we'll use the corresponding
opcodes in the range 72 - 7B.
We'll use opcodes in the range 80 - A0 for unique AltiVec ops like
vector splat that have no scalar equivalents:
80 vector splat
81 vector splat immediate
90 vector select
91 vector permute
*)
200 constant otVecOffset \ all vector ops have this bit set so we
\ can recognize them quickly.
\ Subtype byte values:
\ For comparisons, the top 4 bits of the byte give the condition register field
\ bit which we need to branch on for this condition: LT = 0, GT = 1, EQ = 2.
\ The low bit (bit 7) is 1 if the condition we want has a field bit value of 1.
\ Bit 6 is 1 for unsigned, and bit 5 is 1 for comparisons with zero.
20 constant cmpNE
21 constant cmpEQ
00 constant cmpGE
01 constant cmpLT
10 constant cmpLE
11 constant cmpGT
02 constant cmpHS
03 constant cmpLO
12 constant cmpLS
13 constant cmpHI
24 constant cmpZNE
25 constant cmpZEQ
04 constant cmpZGE
05 constant cmpZLT
14 constant cmpZLE
15 constant cmpZGT
\ here's a table to map our 68k comparison codes to the above PPC ones:
PPC? [IF]
createx comparison_codes
0 cx, \ 0
0 cx, \ 1
cmpHI cx, \ 2
cmpLS cx, \ 3
cmpHS cx, \ 4
cmpLO cx, \ 5
cmpNE cx, \ 6
cmpEQ cx, \ 7
0 cx, \ 8
0 cx, \ 9
0 cx, \ A
0 cx, \ B
cmpGE cx, \ C
cmpLT cx, \ D
cmpGT cx, \ E
cmpLE cx, \ F
decimalx
[ELSE]
create comparison_codes
0 c, \ 0
0 c, \ 1
cmpHI c, \ 2
cmpLS c, \ 3
cmpHS c, \ 4
cmpLO c, \ 5
cmpNE c, \ 6
cmpEQ c, \ 7
0 c, \ 8
0 c, \ 9
0 c, \ A
0 c, \ B
cmpGE c, \ C
cmpLT c, \ D
cmpGT c, \ E
cmpLE c, \ F
decimal
[THEN]
0 value OPERATION
0 value SUBOPERATION
\ Some types of instruction need special treatment - e.g. AND etc. use the
\ rA field for the destination. So we define some types:
enum { noType loadStoreType arithType logicalType cmpType branchType shiftType vecType }
\ ================= UTILITY WORDS ==================
: MONADIC? \ ( opcode -- opcode b )
otNeg otNot within? IF true EXIT THEN
otFPmon otFPend within?
;
: MEM_REFERENCING? \ ( opcode -- b )
$ FF and
SELECT[ otFetch ],
[ otStore ]=> true
DEFAULT=> drop false
]SELECT
;
PPC?
[IF]
: dasm ; \ can't disassemble in native mode yet!
: z ;
[ELSE]
forward dasm \ disassembles what we've done so far
forward dcurr \ disassembles the current defn (even if not finished)
forward Z \ ends ppc compilation and disassembles.
[THEN]
\ GetImmediateOp does the same when we're going to execute the operation
\ now, returning the xt of the word to execute.
: GETIMMEDIATEOP { opType subtype -- xt }
opType
SELECT[ otAdd ]=> ['] +
[ otSub ]=> ['] -
[ otAND ]=> ['] and
[ otOR ]=> ['] or
[ otXOR ]=> ['] xor
[ otMUL ]=> ['] *
[ otDIV ]=> ['] /
[ otNEG ]=> ['] negate
[ otNOT ]=> ['] not
[ otCMP ]=>
subtype
CASE[ cmpNE ]=> ['] <>
[ cmpEQ ]=> ['] =
[ cmpGE ]=> ['] >=
[ cmpLT ]=> ['] <
[ cmpLE ]=> ['] <=
[ cmpGT ]=> ['] >
[ cmpZNE ]=> ['] 0<>
[ cmpZEQ ]=> ['] 0=
[ cmpZGE ]=> ['] 0>=
[ cmpZLT ]=> ['] 0<
[ cmpZLE ]=> ['] 0<=
[ cmpZGT ]=> ['] 0>
DEFAULT=> cr .h ." undef otCMP subtype in getImmediateOp" 1 die
]CASE
[ otUCMP ]=>
subtype
CASE[ cmpHS ]=> ['] u>=
[ cmpLO ]=> ['] u<
[ cmpLS ]=> ['] u<=
[ cmpHI ]=> ['] u>
DEFAULT=> cr .h ." undef otUCMP subtype in getImmediateOp" 1 die
]CASE
[ otShift ]=>
subtype
CASE[ 0 ]=> ['] <<
[ 1 ]=> ['] >>
[ 3 ]=> ['] a>>
DEFAULT=> cr .h ." undef otShift subtype in getImmediateOp" 1 die
]CASE
DEFAULT=> cr .h ." undef op passed to getImmediateOp" 1 die
]SELECT
;
PPC? not
[IF] \ we put this in pnuc3 in target mode
\ 16bits? ( n signed? -- n b )
\ returns true if n will fit in 16 bits (signed or unsigned as requested).
: 16BITS? \ ( n signed? -- n b )
IF -32768 32767 within?
ELSE
dup 16 >> 0=
THEN
;
[THEN]
: SIGNED? \ ( operation - b )
\ Returns true if this is a signed op - assumed to be the
\ default. We just return false for the specific unsigned
\ ops, and true for everything else.
SELECT[ otUDIV ],
[ otAND ],
[ otOR ],
[ otXOR ],
[ otUCMP ]=> false
DEFAULT=> drop true
]SELECT
;
: REVERSE_COMPARISON \ called if we swap operands of a compare. Adjust
\ subOperation appropriately.
subOperation $ 24 and ?EXIT \ out if monadic or EQ or NE
$ 10 xor> subOperation ;
\ This must come before we redefine @ABS below!
PPC? not
[IF]
from pasmMod import{ :PPC_code ;PPC_code
disasm disasm_word disasm_xt
disasm_rng disasm_cnt disasm_one
set_disasm_call_range }
compile: pasmMod
\ While still in 68k-land, we need a PPC-style reloc! and @abs. The proper
\ PPC versions will be compiled in pnuc3 and setup respectively. Note,
\ any changes must of course be made in both places.
: RELOC! { theAddr dest -- }
theAddr addr>S&D
$ ffffff and swap 24 << or
dest !
;
: reloc, DP reloc! 4 ++> DP ;
: relocCode, CDP reloc! 4 ++> CDP ;
: displCode, CDP displ! 4 ++> CDP ;
\ In the target compilation, we only have 4 segments.
: @ABS { addr \ relocAddr seg# displ -- absAddr }
addr @ -> relocAddr
relocAddr $ ffffff and -> displ
relocAddr 24 >> -> seg#
seg#
SELECT[ 8 ]=> code_start
[ 9 ]=> data_start
[ 10 ]=> seg_code_start
[ 11 ]=> seg_data_start
DEFAULT=> ." not a reloc addr" 1 die
]SELECT
displ +
;
: @abs6 @abs ; \ a variant name we can call later when @abs
\ has been redefined in the PPC image
[THEN]
endload
=======================
TECHNICAL DOCUMENTATION
=======================
Here we describe all the nuts and bolts stuff relating to PPC code
generation. We've collected together a whole lot of commentary from
all over our source files -- it should be easier to find things if
they're all in one place, and we can also cut and paste it into the
manual.
The general way to generate optimized code for an architecture with n
registers is to analyse each basic block and generate a directed acyclic
graph (DAG) whose nodes represent each value generated and whose edges
represent uses of those values. Note it isn't a tree, since more than
one edge can go into a node. Then optimization can be done, and common
subexpressions (i.e. nodes) combined.
Then, using graph coloring or whatever, the nodes are assigned to
registers. This assumes there are less regs than nodes, so that we
have to find all dependencies, and re-use a reg when its old value
isn't needed any more.
Here we can simplify things a little, and manage everything in one pass.
On the PPC we have so many regs that for normal shortish basic blocks,
we will generally be able to simply assign a different reg for each node.
On the few occasions when we don't have a free reg, we can select one
which has a value with no outstanding references - there's a slight chance
we could have used this value in a subsequent op, but this is pretty well
negligible. We can optimize on the fly: each time we get a new value,
we can search the reg set for a match so we can use a combined node.
This code is invoked as follows:
Whenever Handlers is called to generate 68K code, if PPC? is true,
the Handlers selector and opcode is pushed and PPCvec is executed.
We will set PPCvec to point to PPC_compile which will use the selector
to dispatch to the right routine.
============= MEMORY ARCHITECTURE ================
On the PPC, code and data is normally kept separate, with the code
being read-only. The PEF format defines separate code and data sections,
and at launch time these are placed in pointer-based blocks in memory.
We ought to conform to this convention for installed apps, since it will
give the best performance.
In the development environment, our dictionary should be in a read-write
block for obvious reasons. We'll define a separate data area, which
will become the data section of an installed app. These two areas
can be in handles, which will allow us to resize them on the fly if
necessary.
At installation time, we'll generate a PEF in which the dictionary is
added to the existing code section, and the data area is added to the
data section. Our initial nucleus will probably be generated exactly
this way, either from the 68k or PPC version. Thus our initial
nucleus is simply an application, which transforms itself into the
Mops development environment by creating a handle for the new dictionary
and data areas. Now, we could allow the new dictionary area to link
itself to the old one by some cleverness, but since the dic would
only be getting split at one place I don't think it would be worth
the complexity. Better to just BlockMove the old dic area to the start
of the new when the dev environment is being set up, and likewise for
the data area.
Colon uses a new header format incorporating two flag bytes in addition to
what we use on the 68k, and also has to observe 4-byte alignment for the
code. For this reason we generate a full PPC-style header (including 4-byte
alignment) for colon definitions. When we compile a call to a colon defn we
have to do PPC-style alignment even on the 68k. For other types of definition
we don't have to bother, and the code should work on both platforms (since
on the PPC FIND will look after the extra alignment).
============= REGISTER DEFINITIONS: ================
We need 3 regs for scratch in boilerplate code sequences. We'll use
r0, which is a bit unusual anyway, and r11 and r12 (see below).
r1 is the stack pointer, r2 is RTOC. We can't monkey with these.
r11 and r12 are used in the calling sequence for external calls. Apple
says they can be used as scratch at all other times, so we'll use them
in boilerplate sequences. PowerMacForth does this too, and in the assembler
they're aliased as rX and rY.
For external calls, r3-r10 are used for parameters, and r3 for a returned
result. They won't be saved over the calls. Of course for internal Mops
calls we can do what we like. We can use these regs for general operands,
and on an external call normalize the stack so that the required number of
cells are stored in r3 on. At that stage we won't have any other cached
stack cells, so we don't need the regs preserved anyway.
This scenario gives us 8 regs for general operands, i.e. cached stack
cells (r3-r10), which should be enough. If it turns out not to be enough
we could grab a couple of the regs we've allocated for locals (see below).
r13-31 are "non-volatile" - they're saved over external calls. For
internal Mops calls we just need to save the locals, since other regs
like the base address regs don't get changed.
Now for the special regs we need. These all need to be saved over external
calls, and so are in the non-volatile block.
For addressing the dictionary, a difference to the 68k version is that
we need to keep code and data separate (see above). In the development
environment these will be in handle-based blocks, and in an installed
app they'll be defined in the PEF which will make them end up in pointer-
based blocks. Anyway, as long as we handle the addressability questions
properly, it shouldn't matter where they are.
Code references will be for branches, constants and other constant data
like literal strings and class info.
Constants can always be handled via literal instructions. Even if it
needs more than 16 bits it can be generated in a reg with 2 instructions
which will be faster than a memory reference.
Branches will always be self-relative, and they have enough displacement
bits to get us anywhere.
For other constant data, however, it's extremely handy to have a base
reg available, even if these references aren't performance-critical.
Thus, if we still have modules (which is still up for grabs), we'll
need 4 base regs - 2 to address code and 2 for data. Note that on
entry, r2 (RTOC) always points to the start of the data area as defined
in the PEF. But we can't use RTOC as a regular base reg, since in the
dev environment our data area will be off in a separate handle.
Other regs we need are RP (return stack pointer), the loop variable I,
and the base address of the current object. Now we may as well use one of
our "local" registers for I, since it will be very rare for us to need
all of them. This will mean one less local in definitions that use I,
but that's not a problem.
It seems we should go for a separate FP stack, since the Scientific
Library is now using this. This will also give better code on the PPC, since
when we bring stack cells back from mem to regs, we'll always know which regs
to move them to. The floating stack pointer probably should be in a
register.
So in all we'll need 7 special regs out of the non-volatile block.
This leaves r19-r31 for locals, which means that if we limit the number
of locals to 13, we can keep them all in registers. This looks reasonable.
Some notes on register handling across internal Mops calls:
Saving and restoring local regs can be a bit long-winded, so to save
space we should normally do what we do on the 68k - that is, at the
start of each defn, save whatever regs we need for locals, and restore
them at the end. For EXIT, instead of doing everything inline as we
did on the 68k, we'll do a branch to the semicolon. This is almost as
fast (esp. as it's an unconditional branch), and saves space.
We'll probably make an exception for leaf procs, since these get executed
so frequently. What I'm currently planning, in the case where the leaf
proc has named parms/locals, is to do the houskeeping in the calling
routine instead of in the called leaf proc. This will give me the
possibility of generating the parms straight in the required regs.
Also I might be able to do the saving and restoring of the needed regs
at the beginning and end of the calling routine (depending on what
parms/locals that routine might need. This would get these housekeeping
operations out of any inner loops.
This alternative calling convention should certainly be faster, but will
take a lot more space, so I won't do it all the time.
The 2 flag bytes are organized as 4 nybbles:
nybble 0: bit 0: 1 if it's a leaf defn
bit 1: 1 if it alters the count register
bit 2: spare
bit 3: 1 if it alters the FP or vector regs.
nybble 1: number of results in registers on exit
nybble 2: number of named parameters
nybble 3: number of named parms + locals
(doing it this way is a bit more convenient, and the
max number of parms+locals is only 11, so we have
enough bits)
If the definition does any floating point or AltiVec operations, or
anything that alters the FP or vector registers, we need more flags,
so we add an extra 32 bits. For alignment, we have to take at least
32, so we might as well make the most of it.
byte 0: $ BB - this is basically just for the disassembler.
byte 1: (if we end up with a vector stack, we might
use bytes 0 and 1 as 4 nybbles as for floats.)
bytes 2 and 3 are 4 nybbles:
nybble 0:
nybble 1: number of floating results in registers on exit
nybble 2: number of named floating parameters
nybble 3: number of named floating parms + locals
=============== FINALIZATION OF DEFINITIONS =================
At semicolon time, there are a number of things we have to fix up in
the definition we just compiled. Once we've compiled the prolog and
epilog and added the const_data, if any, the final location of the
code is known. We can then "finalize" the definition. This includes
resolving any EXITs and LEAVEs and putting in the correct offsets for
calls to other words (these couldn't be determined before the code's
final location was determined).
To handle these various things, we use some pseudo-instructions to stand in
place of the final instructions we're going to put in those same locations
at finalization time. To finalize, we look through the whole definition
for these pseudo-instructions, and take the appropriate action.
For the pseudo-instructions, we have to use opcodes that can't ever be
used for real instructions. Now we'll never use the lmw and stmw
instructions, since Keith D has warned me that they'll go away in
future! This means that we can use their values as pseudo-opcodes,
since we know that Mops will never generate them with their proper
meaning.
The values in question are primary opcodes 46 and 47, which means
$B8xxxxxx to $BFxxxxxx. We also define our handler codes into this
same range, since the two-byte handler code appears on an aligned
boundary. This will prevent a handler code ever being mistaken for an
instruction. Thus all our handler codes and pseudo-instructions
identify themselves, which generally makes life easier.
So far we've defined these:
BAxx xxxx call to a Mops word (xx xxxx is code-relative offset)
BBxx xxxx floating-point/AltiVec flag bytes. This code is basically
for the disassembler, since it's aligned and 3 flag bytes
are plenty. It shouldn't come up in finalization.
BCxx all handler codes which have no boilerplate code (can't
be EXECUTEd). Unlike on the 68k, xx is unsigned and not
doubled (so we can have 256 codes if we need them).
BDxx all other handler codes, except for colon defns. They can
be EXECUTEd.
BE00 handler code for a colon defn
BE01 ditto, but means this is a forward defn.
BE02 marks the start of :loc code
BE03 marks the start of :mloc code
BE04 handler code for a :ppc_proc defn
BE05 handler code for :entry
BE40 method (note BD40 is an inline method - so the 40 always
marks a method)
(we'll reserve BExx for any further options on colon definitions.)
These next two can't ever appear inside a definition, so we give an error
if they're encountered during finalization.
BF01 handler code for SYSCALL and EXTERN
BF0B handler code for LIBRARY
BF0C 0000 case-sensitive name for :ENTRY follows
BF02 0000 EXIT
BF03 xxxx conditional EXIT (xxxx is cond. branch opcode)
BF04 0000 LEAVE
BF05 0000 LOOP
BF06 0000 target of a forward defn. This marker is redundant,
but makes the decompiler output look more sensible.
BF08 xxxx unconditional branch. xxxx is offset (we only need 16 bits).
BF09 xxxx ELSE - branch. xxxx is initially the offset back to the
original conditional branch, in case we delete this branch
and need to adjust. Once the branch is resolved, and we
know it won't be deleted, xxxx becomes the branch offset
as for other unconditional branches. We can tell which
is which, since the first offset is negative and the
second positive.
BF0A replace with a literal load into r0 of the distance the code
is moved. Used in generating the addr of a location
within the current definition (since we don't know until
the end how far it might be moved).
==================== EXTERNAL CALLS ======================
CALL_EXTERN handles an external call. This requires that we set things up as
the PowerPC volume of IM says:
1. We have a pointer which is resolved by the CFM - this will
be the address of a transition vector. This pointer will be in the
data area (since it gets changed), and has a reloc addr pointing
to it in the code area, which belongs to the SYSCALL or EXTERN
word.
We have to allow for new external calls to be asked for, then
executed straight away, so we use a scheme where when we do an
external call, we check whether the pointer has been resolved
yet, and resolve it if it hasn't. We can easily tell, since we
initialize each pointer to nilP, which is an illegal address.
This test and the call to FindSymbol to resolve it, is in
get_transition_vector which is called at the beginning of our
external call sequence.
We could save a couple of instructions by pre-resolving symbols that
are already in the dictionary image, but it's not worth it - it's
better to use just one scheme, and we do need to be able to resolve
on demand, so that's the way we do it.
The transition vector has 2 addresses - the addr for us to branch to,
and the new RTOC value. The dest addr has to be loaded into the CTR
or the LR for us to use it as a branch target. We use the CTR - see
below for the reason for this. We want to load the dest addr as early
as possible so that instruction fetching won't stall, so we do this
part of the setup before we equalize the stack - during the equalization
nothing needs the CTR anyway.
We use r12 for the addr of the transition vector itself, as IM says.
This also won't get messed with during equalization.
We set up r12 and the CTR in get_transition_vector, as well as resolving
the symbol as described above. Factoring as much as possible into
get_transition_vector saves code space in the call sequence for
external calls.
So, as well as a bit of housekeeping, the main thing that CALL_EXTERN
does is to compile a call to get_transition_vector. CALL_EXTERN then
passes 1 as an "xt" to CALL_H. 1 can never be a real xt, since they must
be even, so this tells call_h that this is an external call. CALL_H
looks after everything from here on, including the stack equalization.
2. The first thing call_h does is pass 1 to EQUALIZE_FOR_CALL (in the
equalization section). This gets the parameters into the right regs (and
the parameter area, if necessary), as needed for external calls.
IM envisages that setting the SP is already done by the prolog
of the current routine, on behalf of all external calls that this
routine makes. The parameter area is big enough for the call with the
most parameters, and the others leave some unused space below the parm
area (actually higher in memory). The parm area for each call must come
immediately below the linkage area, so the callee can find it.
But in Mops we have a separate data stack pointer, so we simply
set up a linkage area for external calls using the system SP (gpr1) at
startup time, and never change it after that.
3. call_h then calls COMPILE_EXTERN_CALL (in this file, below) to compile
the actual call. To do this, we store our own RTOC into the linkage area
(actually this is done once and for all at startup since we have a
permanent frame for external calls), and load RTOC from the transition
vector (still pointed to by r12). We then bctrl (branch and link to count
register) to call the external code. (We could equally well have used
the LR - see below.)
Note: the standard sequence for cross-TOC calls in Metrowerks C is as
follows. We do much the same, but in a different order - in particular
we grab the dest addr and get it into the CTR as early as possible, before
we normalize the stack etc., and we move the SP to allocate the parm and
linkage areas on each call.
We could equally well have used the LR instead of the CTR. MW have to use
the CTR since they've done a bl to the out-of-line code, and have to preserve
the LR. But the IBM manual recommends using the CTR for computed branches
like this, to make life easier for debuggers etc, so that's what we'll do.
inline:
bl xxx
lwz r2/TOC, $14(r1/SP)
...
xxx lwz r12, <offs>(r2/TOC) / TOC entry is a pointer to transfer vector
stw r2/TOC, $14(r1/SP) / Save RTOC
lwz r0, (r12) / 1st entry in TV is destination addr
lwz r2/TOC, $4(r12) / 2nd entry is new TOC addr - put in RTOC
mtspr CTR, r0 / dest addr to CTR
bctr / branch there
=================== CLASS/OBJECT FORMATS ======================
=============== Object header ====================
Note if the obj is an ivar, it doesn't have a header if it's in a record,
unless the ivar is indexed. Indexed ivars always have headers, no matter
what, since the indexing code relies on it.
PPC notes: we have to make some minor changes to the object header format
for various reasons.
1. As objects live in the data area, we need a back pointer to the dic
entry in the code area, so methods like .ID: work.
2. The class pointer is a 4-byte relocatable address, and we want it to
be aligned.
3. The indexed length of the object now always occupies the 4 bytes at
the end of the indexed descriptor, whereas on the 68k, although we
allocated 4 bytes, we normally only took notice of the low 2 bytes.
We want the 4 bytes preceding the obj's data to look like a negative
indexed length if the object isn't indexed.
So we end up with this layout:
4 bytes Back pointer to cfa of obj's dic entry (relocatable).
Zero if no dic entry, or if no name field there.
4 bytes Class pointer (relocatable).
2 bytes Self relative offset to the class pointer.
For simple objects (i.e. not embedded), this is -4.
For embedded objects, it will be more negative. Note it
will always be negative.
2 bytes Self relative offset to the indexed area. If not indexed,
this will be 2, pointing to the next byte which is the first
data byte of the object. This means, if we erroneously try
to access the indexed area of a non-indexed object, we'll
get sent to the non-indexed data area, and try to interpret
the preceding 4 bytes as an "indexed length" (which is what
precedes a valid indexed area). But these 4 bytes are these
two offsets, which will always appear as a negative number.
If we're doing range checking, this will always trap.
We did a similar sneaky trick on the 68k, but the field
lengths/positions were different.
(object's data starts here)
This format means that when multiply inheriting, we need 4 bytes
separating each group of ivars ("embedded object"), not 2 as on
the 68k.
For indexed objects, the indexed area (after the ivars) is preceded by
the indexed descriptor (xdesc) with the following format. This format
is the same as on the 68k, except that it starts off-aligned, i.e.
the #elements field is 4-byte aligned.
2 bytes Width of indexed elements (in bytes)
4 bytes Number of elements minus 1 (i.e. LIMIT-1).
We can check this with a trap instruction.
============== Class dictionary entry ================
link/name/hndlr as for normal words - normal class hndlr is $BC1D,
and for imported classes, $BC2D.
Note that the offsets are defined in 2 places, which must agree! -
here, and in pNuc4.
2 bytes flags
(note - we're now aligned)
(offs 2) 32 bytes links to 8-way hashed method chains (relative)
(offs 34) 4 bytes link to ivar chain (relative)
(offs 38) 2 bytes padding for alignment of the n-way, and to put
the next fields in the same place as on the 68k,
which simplifies (findM).
(offs 40) 2 bytes non-indexed data length
(offs 42) 2 bytes width of indexed elements, or zero if not indexed
(offs 44) 2 bytes "xdispl offs" - the ivar offset where indexing starts
(used by large_obj_arrays), or zero if none.
(offs 46) 4(n+1) bytes
n-way to superclasses (n relocatable addrs terminated by zero)
Flag bits:
$0001 "large" - indexed with > 64K elements.
$0002 class is exported from a module
$0004 class is general
$00X0 if nonzero, an object of this class could go
in a register:
3 gpr
4 fpr
5 vector
$0n00 data must be aligned by 2**n. If zero, we use the
normal default alignment. (So far we only use this
for vectors, which must be 16-byte aligned.)
$8000 class is META (we use this to terminate NW_IVsetup)
Note: on the 68k, at the class cfa there was a call to BLD, the word
which built an object. These 4 bytes also served as a unique marker
identifying a class dic entry, since we did a JSR to BLD, and this
always had the same bit pattern. We treated the class handler code
the same as for a colon definition, and simply BSR'd to the cfa.
On the PPC we'll really use class_h, so that a class won't
look like a colon definition any more. This is a bit more logical
and shouldn't cause any problems. In any case, we couldn't use a
call to BLD as a unique marker, since calls are all self-relative
on the PPC so that calls to any particular word will have a different
bit pattern depending on where they are.
class_h is $BC1D, and will give an error if we try to EXECUTE a class
(as with all $BCxx codes). I doubt this will break any existing code.
Note also we've moved the "flags" from after the indexed width item,
up to be the first item. This is so the 4-byte items come out aligned
without sundry padding.
============== ivar dictionary entry ================
\ note: this format is the same on the PPC and 68k, for once!
4 bytes hashed name
4 bytes link to prev ivar dic entry (self-relative addr)
4 bytes class pointer (relocatable)
2 bytes offset of this ivar's data from the base addr of the class
2 bytes number of elements if indexed, or zero if not
2 bytes flags
Flag bits:
$0001 ivar gets an object header
$0002 this is a static ivar
$0004 this is a public ivar
$XXX0 this ivar is in a register (it's actually a temp object).
Reg bits are as for class flags above, in the lo 4 bits,
and the actual reg number in the next 5 bits. So far
the hi 3 bits are free (zero).
Note: although indexed objects can have 2^^32 elements, we are
assuming that an ivar can't have more than 64K elements. This is
because we are limiting the maximum ivar length of a class to 64K bytes,
which is a stricter condition. Would anybody want a longer ivar than
this??
============== Method dictionary entry ================
4 bytes hashed name
4 bytes link to prev method dic entry (self-relative addr)
2 bytes method flags
2 bytes 0 (alignment)
2 bytes $BE40 - "handler code" - not actually used as a handler, but
marks this as a method for decompiler etc. Note that an
inline method uses the code BD40, so I can just look for
40 in the low byte to see if it's a method.
2 bytes flag bytes as for colon defns, giving number of named parms etc.
This is the method's cfa (xt) here.
(method code follows)
Method flag bits:
$0001 private method (note other way round to ivars - we're using
1 for the unusual case)
$0080 there's a callFirst and/or callLast method
Note that the method code starts 6 bytes later than in the 68k version.
==========================================================
*)